sirajwikiaorg-20200213-history
Markov Decision Process (MDP)
https://www.theschool.ai/courses/move-37-course/lessons/markovdecisionprocesses/ Things that are known as Markovian include: * In probability theory and statistics, Markov process and the Markov property, both named for Andrey Markov * the Markovians — a vanished and mysterious god-like alien species from Jack L. Chalker's Well World novels Ok, so according to :-''' 'Markov decision processes formally describe an environment for reinforcement learning Where the environment is fully observable i.e. The current state completely characterises the process Almost all RL problems can be formalised as MDPs, e.g. Optimal control primarily deals with continuous MDPs Partially observable problems can be converted into MDPs Bandits are MDPs with one state. ' http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf '''According to https://en.wikipedia.org/wiki/Markov_decision_process: A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming and reinforcement learning. MDPs were known at least as early as the 1950s;[1] a core body of research on Markov decision processes resulted from Howard's 1960 book, Dynamic Programming and Markov Processes.[2] They are used in many disciplines, including robotics, automatic control, economics and manufacturing. The name of MDPs comes from the Russian mathematician Andrey Markov. Move 37 article Reinforcement learning problems are mathematically described using a framework called Markov decision processes (MDPs). MDPs are the extended version of Markov Chain which adds decisions and rewards elements to it. The word Markov here refers to that Markovian property which means that the future state is independent of any previous states history given the current state and action. This means that current state encapsulates all that is needed to decide the future state when an input action is received. This is a reasonable assumption in many problems and it simplifies things a lot. For example, in chess game, the chess board configuration after a move is being made can be decided based on the current board configuration and the action being made now and we don’t need to worry about previous chess board configurations or past actions. MDP is an approach in achieving reinforcement learning to take decisions in a matrix. A grid would consist of states in the form of grids. The MDP tries to capture a world in the form of a grid by dividing it into states, actions, transition matrix, and rewards. The solution to an MDP is called a policy and the objective is to find the optimal policy for a task that MDP is imposed. Thus, any reinforcement learning task composed of a set of states, actions, and rewards that follows the Markov property would be considered an MDP. In this tutorial, we will dig deep into MDPs, states, actions, rewards, and policies. What is a State? A State is a set of tokens that represent every condition that the agent can be in. What is a Model? A Model (sometimes called Transition Model) gives an action’s effect in a state. In particular, T(S, a, S’) defines a transition T where being in state S and taking an action ‘a’ takes us to state S’ (S and S’ may be same). For stochastic actions (noisy, non-deterministic) we also define a probability P(S’|S,a) which represents the probability of reaching a state S’ if action ‘a’ is taken in state S. What are Actions? An Action ‘a’ is set of all possible decisions. a(s) defines the set of actions that can be taken being in state S. What is a Reward? A Reward is a real-valued response to an action. R(s) indicates the reward for simply being in the state S. R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S, a, S’) indicates the reward for being in a state S, taking an action ‘a’ and ending up in a state S’. What is a Policy? A policy is a solution to the Markov Decision Process. A policy is a set of actions that are taken by the agent to reach a goal. It indicates the action ‘a’ to be taken while in state S. A policy is denoted as ‘Pi’ π(s) –> ∞ π* is called the optimal policy, which maximizes the expected reward. Among all the policies taken, the optimal policy is the one that optimizes to maximize the amount of reward received or expected to receive over a lifetime. For an MDP, there’s no end of the lifetime and you have to decide the end time. Thus, the policy is nothing but a guide telling which action to take for a given state. It is not a plan but uncovers the underlying plan of the environment by returning the actions to take for each state. Markov Decision Process (MDP) is a tuple(S,A,T,r,?): *Rewards specify what the agent needs to achieve, not how to achieve it. (Source: Sutton and Barto,2017) ‘S’ Set of observations. The agent observes the environment state as one item of this set. ‘A’ Set of actions. The set of actions the agent can choose one from to interact with the environment. ‘T’ – P(s’ | s, a) transition probability matrix. This models what next state s’ will be after the agent makes the action a while being in the current state ‘s’. ‘r’ – P(r | s, a) reward model that models what reward the agent will receive when it performs an action a when it is in state ‘s’. ‘?: discount factor. This factor is a numerical value between 0 and 1 that represents the relative importance between immediate and future rewards. I.e, If the agent has to select between two actions one of them will give it a high immediate reward immediately after performing the action but will lead into going to state from which the agents expect to get less future rewards than another state that can be reached after doing an action with less immediate reward? In a real-world scenario: Robotic Vaccum cleaner famously known as Roomba is a machine that cleans the floor. Roomba needs to clean, avoid obstacles and find the charging station. These 4 states describe the possible positions of the robot and the action describes the direction of motion. The robot can move to the left or to the right. The first (Battery Full) and the final (Charging) states are the terminal states. The goal is to find an optimal policy that maximizes the return from any initial states. We can use the Markov Decision Process to frame this as a reinforcement learning problem. The states are the possible locations of the robot, the actions are the possible directions, and the rewards are either a +1 or -1, depending on where the robot lands. The real question is, how do we find the optimal policy for our robot using this framework? That’s a question we’ll answer next week! �� Reference Write the second section of your page here.